Skip to content

feat: Helm umbrella chart for observability-stack#2

Closed
kylehounslow wants to merge 65 commits into
mainfrom
feat/helm-charts
Closed

feat: Helm umbrella chart for observability-stack#2
kylehounslow wants to merge 65 commits into
mainfrom
feat/helm-charts

Conversation

@kylehounslow

Copy link
Copy Markdown
Owner

Summary

Adds charts/observability-stack/ — a Helm umbrella chart that deploys the full observability stack on Kubernetes, mirroring the existing docker-compose setup.

Components (all upstream charts as dependencies)

Component Chart Source Version
OpenSearch opensearch-project/helm-charts 3.5.0
OpenSearch Dashboards opensearch-project/helm-charts 3.5.0
Data Prepper opensearch-project/helm-charts 0.3.1 (image overridden to 2.15.0-SNAPSHOT)
OTel Collector open-telemetry/opentelemetry-helm-charts 0.147.0
Prometheus prometheus-community/helm-charts 28.13.0

What's included

  • Umbrella Chart.yaml with all 5 dependencies
  • values.yaml translating docker-compose config to Helm values
  • Init job (post-install hook) — creates workspace, index patterns, dashboards, saved queries, correlations, Prometheus datasource
  • Prometheus remote-write sink on service-map pipeline for RED metrics
  • NOTES.txt with post-install connection info

Validated on

  • macOS arm64, kind + finch
  • All 5 pods running, cluster green
  • End-to-end trace pipeline: curl → OTel Collector → Data Prepper → OpenSearch
  • Init job completes in ~30s, all saved objects created ✅
  • OSD UI accessible via port-forward ✅

Issues discovered and fixed during development

  1. OpenSearch 3.5+ requires OPENSEARCH_INITIAL_ADMIN_PASSWORD env var
  2. Data Prepper chart default (2.8.0) too old — lacks otlp source plugin
  3. Data Prepper needs explicit ssl: false + peer_forwarder.ssl: false
  4. OpenSearch chart creates service name opensearch-cluster-master (not {release}-opensearch)
  5. OSD config needs opensearch.username/opensearch.password explicitly
  6. Prometheus service port is 80 (not 9090)
  7. Prometheus sink is experimental — needs experimental.enabled_plugins in DP config

Follow-up items

  • Switch Data Prepper image to official release once opensearch-project/data-prepper#6595 (prometheus-sink auth support by @ps48) is merged and released. Currently using sgguruda62324/opensearch-data-prepper:2.15.0-SNAPSHOT custom build.
  • Centralize credentials (single values source instead of hardcoded in multiple places)
  • Templatize hardcoded service names
  • Example agent deployment templates
  • values-staging.yaml for opensearchstaging bleeding-edge images
  • Ingress template for public-facing deployment
  • README with quickstart (kind + finch instructions)
  • EKS deployment for KubeCon 2026 EU demo

@kylehounslow kylehounslow force-pushed the feat/helm-charts branch 3 times, most recently from 39a1d5f to e56f0fc Compare March 19, 2026 05:27
Adds charts/observability-stack/ as an umbrella Helm chart using
upstream dependencies:
- opensearch 3.5.0
- opensearch-dashboards 3.5.0
- data-prepper 0.3.1
- opentelemetry-collector 0.147.0
- prometheus 28.13.0

values.yaml mirrors the existing docker-compose configuration.
All 5 dependencies resolve and helm template renders successfully.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Add OPENSEARCH_INITIAL_ADMIN_PASSWORD env var for OpenSearch 3.5+
- Override Data Prepper image to latest (chart default 2.8.0 lacks otlp source)
- Explicitly disable SSL on Data Prepper server and peer_forwarder
- Fix service name references for inter-component connectivity
- Use otel/opentelemetry-collector-contrib image (required by chart)

Validated: all 5 pods running on kind + finch (macOS arm64)

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
OSD was failing to authenticate to OpenSearch (401 No Authorization header).
Added opensearch.username and opensearch.password to opensearch_dashboards.yml.

End-to-end pipeline validated:
  curl → OTel Collector → Data Prepper → OpenSearch ✅
  OSD UI accessible on port-forward ✅

TODO: centralize credentials via .env-style values (not hardcoded)

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Helm post-install/post-upgrade hook that runs the existing
init-opensearch-dashboards.py script as a K8s Job.

Creates: workspace, index patterns (logs/traces/service-map),
trace-to-logs correlation, APM config, agent observability
dashboard, overview dashboard, and saved queries.

Script patched to read BASE_URL from env var for K8s service names.

Validated: job completes in ~30s, all saved objects created.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Image: sgguruda62324/opensearch-data-prepper:2.15.0-SNAPSHOT
  (matches docker-compose .env, includes ps48's prometheus auth PR #6595)
- Correct experimental plugin syntax for DP 2.15
- Re-added prometheus remote-write sink to service-map pipeline
- All pipelines initialized including RED metrics to Prometheus

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Add opensearch-credentials Secret template
- Init job references secret via secretKeyRef instead of hardcoded values
- Document that DP pipeline configs still need manual password sync
  (subchart values don't support Go templating)

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
CronJob runs every 2 minutes, sends 5 agent traces per run with
realistic GenAI semantic convention attributes:
- invoke_agent spans with gen_ai.agent.name
- chat spans with gen_ai.request.model, token usage, provider
- execute_tool spans with gen_ai.tool.name
- Randomized models (gpt-4o, claude-sonnet-4-20250514, nova-pro)

Validated: 20 spans indexed in OpenSearch from single canary run.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Docker-compose .env uses opensearchstaging/opensearch:3.6.0 and
opensearchstaging/opensearch-dashboards:3.6.0. Helm chart was
using the official 3.5.0 images which are significantly behind.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Missing explore, agentTraces, discoverTraces, discoverMetrics,
query enhancements, new home page, and experimental features.
Config now matches docker-compose opensearch_dashboards.yml.

Plugins now loading: explore, agentTraces, observabilityDashboards,
queryEnhancements, datasetManagement (54 total).

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
1:1 parity with docker-compose.examples.yml:
- example-weather-agent (FastAPI + OTel instrumented)
- example-events-agent
- example-travel-planner (orchestrator)
- example-mcp-server (mock tool server)
- example-canary (periodic invocations with fault injection)

All services, env vars, ports, and memory limits match compose.
Images built locally and loaded into kind via finch save/load.

Validated: all 5 agents running, canary invoking travel-planner
with fault injection, traces flowing to OpenSearch.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
- Gateway + HTTPRoute templates (replaces legacy Ingress)
- Two supported providers: envoy (Envoy Gateway), aws (VPC Lattice)
- Envoy: TLS via K8s secret (cert-manager or manual)
- AWS: TLS via ACM certificate annotation
- Disabled by default (gateway.enabled: false)
- Contributors can add GCP/Azure support

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Init job was hardcoding 'opensearch:9200' but the actual service is
'opensearch-cluster-master:9200'. Pass OPENSEARCH_ENDPOINT env var
from the job template.

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
- Add saved-queries-traces.yaml and saved-queries-metrics.yaml to chart
- Add architecture.png as binaryData in ConfigMap
- Mount all files to /config so init script can find them
- Update overview dashboard on every run (not skip if exists)
- 20 saved queries now load, architecture image embedded in dashboard

Signed-off-by: Kyle Hounslow <kylehounslow@gmail.com>
Signed-off-by: Kyle Hounslow <kylhouns@amazon.com>
…quest, OTel CPU, spans dropped, Prometheus query latency)
…of truth) to helm, preserve K8s dashboard call
…ect#107 (adds Data Prepper panels, fixes prometheus panels)
- 29 tests across 6 suites covering all custom templates
- credentials, examples, gateway, init-dashboards, opensearch-exporter
- Tests conditional rendering, custom values, labels, annotations
- Wrapper script runs helm lint + helm unittest
- Requires helm-unittest plugin
Runs on push/PR to main when charts/ or test/helm-test.sh change.
kylehounslow and others added 26 commits March 20, 2026 10:56
Matches upstream canary changes (shallow/normal/deep trace shapes).
Port anonymous auth from docker-compose to Helm/K8s deployment.
Closes #5.

Changes:
- Add anonymousAuth.enabled toggle in values.yaml (default: false)
- Create opensearch-security-config Secret with config.yml, roles.yml,
  roles_mapping.yml — anonymous_auth_enabled templated from values
- Update OpenSearch Dashboards config with anonymous_auth_enabled and
  conditional savedObjects.permission.enabled via global values + tpl
- Sync init script with docker-compose version (ANONYMOUS_AUTH_ENABLED
  env var, conditional anonymous role in workspace allowedRoles)
- Pass OPENSEARCH_ANONYMOUS_AUTH_ENABLED env var to init-dashboards Job
- Wire up Terraform anonymous_auth variable to Helm release
- Add 6 helm-unittest tests covering both enabled/disabled states
- Document usage in chart README

Usage:
  helm install obs-stack charts/observability-stack \
    --set anonymousAuth.enabled=true \
    --set global.anonymousAuth.enabled=true

Kiro/claude on behalf of @kylehounslow
…adation

- Fixed k6 auth (manual Base64 header) and PPL query syntax
- Test 002: 300 VUs, 0% errors, p95=16ms — no stress
- Test 003: 1500 VUs, 0% errors, p95=2.28s — saturated
- Breaking point estimated between 500-700 VUs for good UX
… panels

- New saved-queries-self-monitoring.yaml: thread pool rejections, search latency,
  Prometheus query latency P99, Data Prepper buffer capacity, OTel dropped spans
- OpenSearch Health dashboard: added thread pool rejections, active searches,
  search latency, fetch rate panels
- Pipeline Health dashboard: added Prometheus range query latency P99 panel
- init-dashboards-configmap: include new saved queries file
- Terraform module spins up m5.xlarge in same VPC as EKS cluster
- k6 scripts hit OSD through ALB (real user path: TLS + WAF + ingress)
- api-queries-alb.js: PPL, search, PromQL, dashboard loads, service map
- run-remote.sh: upload scripts + run tests on EC2
- Previous tests via port-forward were bottlenecked by kubectl tunnel
…t 99% CPU is next

- OSD scaled 1×100m → 3×2CPU: median latency 3s → 824ms, 0% errors
- OpenSearch now the bottleneck: 99-100% CPU, search queue peak 34
- Hot threads: write/refresh contention from OTel Demo indexing
- p95=14.57s at 1000 VUs — need to scale OpenSearch next
…g applied

- OSD scaled to 3 replicas, 2 CPU / 2Gi (resolved OSD bottleneck)
- Documented 3 OpenSearch scaling options: horizontal, dedicated search nodes, vertical
- Official approach: separate index/search with remote store + search replicas
- Recommended: start with 3 data nodes (Option A), simplest path
- singleNode: false, replicas: 3
- JVM heap: 1g → 2g (50% of 4Gi RAM)
- CPU: 500m req / 2000m limit
- EKS scaled to 4 nodes to fit the cluster
…and scaling recommendations

- Estimated concurrent user capacity by experience tier
- 7-day and 30-day data volume projections
- Scaling recommendations by user count with cost estimates
- Load test history summary with key findings
- Tracks what hasn't been tested yet
…lity

- Exact commands for uploading scripts, running tests, monitoring, retrieving results
- Current deployment state and access points
- k6 script details and known issues
- Key learnings and gotchas discovered during testing
- File structure and next steps
…s (was 143)

- number_of_replicas=2 gives every node a copy of every shard
- Node-2 went from 4k to 53k queries (12.6x improvement)
- 62% throughput improvement over single-node baseline
- Remaining bottleneck: primary shard routing preference on Node-0
feat: Add anonymous authentication support to Helm chart
…rsistent config)

- Ingress: HTTPS/443 with ACM cert, TLS 1.3, external-dns hostname
- Health check: /app/login (unauthenticated, returns 200)
- OSD: replicaCount (not replicas) — correct subchart key
- OpenSearch: 4 CPU limit, 2 CPU request, 4Gi RAM, 2Gi JVM
- Prometheus: 50Gi PV, 2Gi/4Gi memory, 500m/1000m CPU
- OTel Demo: enabled in values.yaml
- preference=_replica in k6 search queries

Lesson: never use helm --reset-values or --set for config that should persist
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant